103 research outputs found

    NUMA impact on network storage protocolsover high-speed raw Ethernet

    Get PDF
    Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015.Current storage trends dictate placing fast storage devices in all servers and using them as a single distributed storage system. In this converged model where storage and compute resources co-exist in the same server, the role of the network is becoming more important: network overhead is becoming a main imitation to improving storage performance. In our previous work we have designed Tyche, a network protocol for converged storage that bundles multiple 10GigE links transparently and reduces protocol overheads over raw Ethernet without hardware support. However, current technology trends and server consolidation dictates building servers with large amounts of resources (CPU, memory, network, storage). Such servers need to employ Non-Uniform Memory Architectures (NUMA) to scale memory performance. NUMA introduces significant problems with the placement of data and buffers at all software levels. In this paper, we first use Tyche to examine the performance implications of NUMA servers on end-to-end network storage performance. Our results show that NUMA effects have significant negative impact and can reduce throughput by almost 2x on servers with as few as 8 cores (16 hyper-threads). Then, we propose extensions to network protocols that can mitigate this impact. We use information about the location of data, cores, and NICs to properly align data transfers and minimize the impact of NUMA servers. Our design almost entirely eliminates NUMA effects by encapsulating all protocol structures to a “channel” concept and then carefully mapping channels and their resources to NICs and NUMA nodes.We thankfully acknowledge the support of the European Commission under the 7th Framework Programs through the NanoStreams (FP7-ICT-610509) project, the HiPEAC3 (FP7-ICT-287759) Network of Excellence, and the COST programme Action IC1305, ’Network for Sustainable Ultrascale Computing (NESUS)’

    D5.1: Accelerator Deployment Models

    Get PDF
    In this deliverable, we explore this question by studying accelerator deployment models. Under accelerator, we understand for example application-specific GPUs or specially programmed FPGAs. A deployment specifies types, amount, and connectivity of accelerators in a datacenter. With these definitions in mind, we created a theoretical model of the datacenter, its components, expected workloads, and finally, it is possible deployments. We have developed VineSim, a software simulator of a datacenter, based on the aforementioned theoretical modeling. VineSim takes as inputs a workload and a deployment description and outputs performance metrics of interest, such as job latency and resource utilization. In VineSim, one can configure several parameters, including how tasks are allocated to nodes, and estimations of how fast they execute on different accelerators. VineSim can be used to explore how different deployments respond to different kinds of workloads, thus allowing one to determine how to best compose a datacenter based on particular workload, performance, or budgeting requirements

    Arax: A Runtime Framework for Decoupling Applications from Heterogeneous Accelerators

    Full text link
    Today, using multiple heterogeneous accelerators efficiently from applications and high-level frameworks, such as TensorFlow and Caffe, poses significant challenges in three respects: (a) sharing accelerators, (b) allocating available resources elastically during application execution, and (c) reducing the required programming effort. In this paper, we present Arax, a runtime system that decouples applications from heterogeneous accelerators within a server. First, Arax maps application tasks dynamically to available resources, managing all required task state, memory allocations, and task dependencies. As a result, Arax can share accelerators across applications in a server and adjust the resources used by each application as load fluctuates over time. dditionally, Arax offers a simple API and includes Autotalk, a stub generator that automatically generates stub libraries for applications already written for specific accelerator types, such as NVIDIA GPUs. Consequently, Arax applications are written once without considering physical details, including the number and type of accelerators. Our results show that applications, such as Caffe, TensorFlow, and Rodinia, can run using Arax with minimum effort and low overhead compared to native execution, about 12% (geometric mean). Arax supports efficient accelerator sharing, by offering up to 20% improved execution times compared to NVIDIA MPS, which supports NVIDIA GPUs only. Arax can transparently provide elasticity, decreasing total application turn-around time by up to 2x compared to native execution without elasticity support

    CumuloNimbo: A highly-scalable transaction processing platform as a service.

    Get PDF
    One of the main challenges facing next generation Cloud platform services is the need to simultaneously achieve ease of programming, consistency, and high scalability. Big Data applications have so far focused on batch processing. The next step for Big Data is to move to the online world. This shift will raise the requirements for transactional guarantees. CumuloNimbo is a new EC-funded project led by Universidad Politécnica de Madrid (UPM) that addresses these issues via a highly scalable multi-tier transactional platform as a service (PaaS) that bridges the gap between OLTP and Big Data applications

    Using Lightweight Transactions and Snapshots for Fault-Tolerant Services Based on Shared Storage Bricks

    Full text link
    To satisfy current and future application needs in a cost effective manner, storage systems are evolving from mono-lithic disk arrays to networked storage architectures based on commodity components. So far, this architectural transi-tion has mostly been envisioned as a way to scale capacity and performance. In this work we examine how the block-level interface exported by such networked storage systems can be extended to deal with reliability. Our goals are: (a) At the design level, to examine how strong reliability se-mantics can be offered at the block level; (b) At the imple-mentation level, to examine the mechanisms required and how they may be provided in a modular and configurable manner. We first discuss how transactional-type semantics may be offered at the block level. We present a system design that uses the concept of atomic update intervals combined with existing, block-level locking and snapshot mechanisms, in contrast to the more common journaling techniques. We discuss in detail the design of the associated mechanisms and the trade-offs and challenges when dividing the re-quired functionality between the file-system and the block-level storage. Our approach is based on a unified and thus, non-redundant set of mechanisms for providing reliability both at the block and file level. Our design and imple-mentation effectively provide a tunable, lightweight trans-actions mechanism to higher system and application layers. Finally, we describe how the associated protocols can be implemented in a modular way in a prototype storage sys-tem we are currently building. As our system is currently being implemented, we do not present performance results

    Data management techniques

    Get PDF
    Today, it is projected that data storage and management is becoming one of the key challenges in order to achieve ultrascale computing for several reasons. First, data is expected to grow exponentially in the coming years and this progression will imply that disruptive technologies will be needed to store large amounts of data and more importantly to access it in a timely manner. Second, the improvement of computing elements and their scalability are shifting application execution from CPU bound to I/O bound. This creates additional challenges for significantly improving the access to data to keep with computation time and thus avoid high-performance computing (HPC) from being underutilized due to large periods of I/O activity. Third, the two initially separate worlds of HPC that mainly consisted on one hand of simulations that are CPU bound and on the other hand of analytics that mainly perform huge data scans to discover information and are I/O bound are blurring. Now, simulations and analytics need to work cooperatively and share the same I/O infrastructure

    The VINEYARD Approach: Versatile, Integrated, Accelerator-Based, Heterogeneous Data Centres.

    Get PDF
    Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases (a) a bio-informatics application for high-accuracy brain modeling, (b) two critical financial applications, and (c) a big-data analysis application
    • …
    corecore